Nearly every aspect of soil survey involves the question: “Is X more similar to Y or to Z?” The quantification of similarity within a collection of horizons, pedons, components, map units, or even landscapes represents an exciting new way to enhance the precision and accuracy of the day-to-day work of soil scientists. After completing this module, you should be able to quantitatively organize objects based on measured or observed characteristics in a consistent and repeatable manner. Perhaps you will find a solution to the long-standing “similar or dissimilar” question.
Most of the examples featured in this whirlwind tour are based on soil data from McGahan, D.G., Southard, R.J, Claassen, V.P. 2009. Plant-available calcium varies widely in soils on serpentinite landscapes. Soil Sci. Soc. Am. J. 73: 2087-2095. These data are available in the dataset “sp4” that is built into aqp package for R.
There are shelves of books and thousands of academic articles describing the theory and applications of “clustering” and “ordination” methods. This body of knowledge is commonly described as the field of numerical taxonomy (Sneath and Sokal 1973). Central to this field is the quantification of similarity among “individuals” based on a relevant set of “characteristics.” Individuals are typically described as rows of data with a single characteristic per column, together referred to as a data matrix. For example:
| name | clay | sand | Mg | Ca | CEC_7 |
|---|---|---|---|---|---|
| A | 21 | 46 | 25.7 | 9.0 | 23.0 |
| ABt | 27 | 42 | 23.7 | 5.6 | 21.4 |
| Bt1 | 32 | 40 | 23.2 | 1.9 | 23.7 |
| Bt2 | 55 | 27 | 44.3 | 0.3 | 43.0 |
Quantitative measures of similarity are more conveniently expressed as distance, or dissimilarity; in part because of convention and in part because of computational efficiency. In the simplest case, dissimilarity can be computed as the shortest distance between individuals in property-space. Another name for the shortest linear distance between points is the Euclidean distance. Evaluated in two dimensions (between individuals \(p\) and \(q\)), the Euclidean distance is calculated as follows:
\[D(p,q) = \sqrt{(p_{1} - q_{1})^{2} + (p_{2} - q_{2})^{2}}\]
where \(p_{1}\) is the 1st characteristic (or dimension) of individual \(p\).
There are many other ways to define “distance” (e.g. distance metrics), but they will be covered later.
Using the sand and clay percentages from the data above, dissimilarity is represented as the length of the line connecting any two individuals in property space.
The following is a matrix of all pair-wise distances (the distance matrix):
| A | ABt | Bt1 | Bt2 | |
|---|---|---|---|---|
| A | 0.0 | 7.2 | 12.5 | 38.9 |
| ABt | 7.2 | 0.0 | 5.4 | 31.8 |
| Bt1 | 12.5 | 5.4 | 0.0 | 26.4 |
| Bt2 | 38.9 | 31.8 | 26.4 | 0.0 |
Note that this is the full form of the distance matrix. In this form, zeros are on the diagonal (i.e. the distance between an individual and itself is zero) and the upper and lower “triangles” are symmetric. The lower triangle is commonly used by most algorithms to encode pair-wise distances.
| A | ABt | Bt1 | |
|---|---|---|---|
| ABt | 7.2 | ||
| Bt1 | 12.5 | 5.4 | |
| Bt2 | 38.9 | 31.8 | 26.4 |
Interpretation of the matrix is simple: Individual “A” is more like “ABt” than like “Bt1.” It is important to note that quantification of dissimilarity (distance) among individuals is always relative: “X is more like Y, as compared to Z.”
Euclidean distance doesn’t make much sense if the characteristics do not share a common unit of measure or range of values. Nor is it relevant when some characteristics are categorical and some are continuous. For example, distances are distorted if you compare clay (%) and exchangeable Ca (cmol/kg).
In this example, exchangeable Ca contributes less to the distance between individuals than clay content, effectively down-weighting the importance of the exchangeable Ca. Typically, characteristics are given equal weight (Sneath and Sokal 1973); however, weighting is much simpler to apply after standardization.
Standardization of the data matrix solves the problem of unequal ranges or units of measure, typically by subtraction of the mean and division by standard deviation (z-score transformation).
\[x_{std} = \frac{x - mean(x)}{sd(x)}\]
There are several other standardization methods covered later. The new data matrix looks like the following:
| name | clay | sand | Mg | Ca | CEC_7 |
|---|---|---|---|---|---|
| A | -0.86 | 0.88 | -0.35 | 1.23 | -0.47 |
| ABt | -0.45 | 0.40 | -0.55 | 0.36 | -0.63 |
| Bt1 | -0.12 | 0.15 | -0.60 | -0.59 | -0.40 |
| Bt2 | 1.43 | -1.43 | 1.49 | -1.00 | 1.49 |
Using the standardized data matrix, distances computed in the property space of clay and exchangeable calcium are unbiased by the unique central tendency or spread of each character.
Rarely can the question of “dissimilarity” be answered with only two characteristics (dimensions). Euclidean distance, however, can be extended to an arbitrary number of \(n\) dimensions.
\[D(p,q) = \sqrt{ \sum_{i=1}^{n}{(p_{i} - q_{i})^{2}} }\]
In the equation above, \(i\) is one of \(n\) total characteristics. Imagining what distance “looks like” is difficult if there are more than three dimensions. Instead, examine the distance matrix calculated using all five characteristics.
Rescaling to the interval {0,1}.
You can now begin to describe dissimilarity between individuals using an arbitrary number of (relevant) characteristics. You can make statements like “The A horizon is roughly 2x more similar to the ABt horizon than it is to the Bt horizon.” Although this is a trivial example, the utility of generalizing these methods to soil survey operations should be obvious.
Missing data are a fact of life. Soil scientists are quite familiar with missing lab data (“Why didn’t we request optical grain counts?”) or missing essential NASIS pedon data elements, such as horizon bottom depth, estimated clay fraction, or pH. Nearly all of the methods described in this document are very sensitive to missing data. In other words, they won’t work! Following are a couple of possible solutions:
Dendrograms are a convenient approximation of pair-wise distances between individuals (after application of hierarchical grouping criteria; more on this later). Dissimilarity between branches is proportional to the level at which branches merge: branching at higher levels (relative to the root of the tree) suggests greater dissimilarity; branching at lower levels suggests greater similarity. Consider the previous example in which distance between individuals was defined in terms of sand and clay percentages.
Interpretation is simple. Euclidean distance in property-space is directly proportional to branching height in the corresponding dendrogram. Visualizing the geometry of pair-wise distances in more than three dimensions is difficult. A dendrogram, however, can conveniently summarize a distance matrix created from an arbitrary number of characteristics (dimensions). It is important to note that some information about pair-wise distances is lost or distorted in the dendrogram. Distortion is least near the terminal “leaves” of the dendrogram. This phenomena is analogous to the distortion generated by a map projection. It is impossible to flatten a higher-dimensional entity to a lower-dimensional form without causing distortion.
The branches of a dendrogram can be rotated like a mobile, so that the final ordering of the terminal “leaves” approximates an alternate ranking. For example, branches of the following dendrogram (right-hand side) have been rotated to approximate the expected hydrologic gradient from summit to toeslope.
Cluster analysis is a massive topic that deals with the seemingly simple task of finding useful groups within a dataset. This topic and the methods used are also referred to as “unsupervised classification” in the fields of remote sensing and GIS. All of the available algorithms will find groups in a given dataset; however, it is up to the subject expert to determine the following:
Note that the widespread use of color in the following examples is not for aesthetic purposes. Colors are convenient for tri-variate data-spaces because you can visually integrate the information into a self-consistent set of classes.
Hierarchical clustering is useful when a full distance matrix is available and the optimal number of clusters is not yet known. This form of clustering creates a data structure that can encode “grouping” information from one cluster to as many clusters as there are individuals. The expert must determine the optimal place to “cut the tree” and generate a fixed set of clusters. The results from a hierarchical clustering operation are nearly always presented as a dendrogram.
There are two main types of hierarchical clustering.
Both methods are strongly influenced by the choice of standardization method and distance metric. Both methods require a full, pair-wise distance matrix as input. This requirement can limit the use of hierarchical clustering to datasets that can be fit into memory.
The agglomerative methods also depend on the choice of linkage criterion. Some of these criteria include:
agnes() and recommended by (Kaufman and Rousseeuw 2005),See (Kaufman and Rousseeuw 2005), (Arkley 1976), (Legendre and Legendre 1998), and agnes() manual page for a detailed description of these linkage criteria. Selection of linkage criteria can be quantitatively evaluated using the cophenetic correlation; and index of how well a dendrogram preserves the original pair-wise distances. This thread on StackExchange has a nice summary of various linkage criteria. More on this later.
Centroid and medoid cluster analyses are commonly referred to as k-means-style analysis. “K-means,” however, is just one of many possible clustering algorithms that partition property-space into a fixed number of groups. These type of algorithms can be applied to very large datasets because they do not rely on the distance matrix. Instead, they are based on an iterative shuffling of group “centroids” until some criterion is minimized, for example, the mean variance within groups.
This section describes three (out of many) of the most important partitioning-type algorithms.
All of these methods are sensitive to the type of standardization applied to the characteristics. These methods rely on iterative minimization of one or more criteria; therefore, each clustering “run” may generate slightly different output. Most implementations re-run the algorithm until it stabilizes.
Humans are generally quite good at extracting spatial patterns, almost instantly, from two dimensional fields: faces, written language, etc. Sadly, this ability does not extend beyond two or three dimensions. The term ordination refers to a suite of methods that project coordinates in a high-dimensional space into suitable coordinates in a low-dimensional (reduced) space. Map projections are a simple form of ordination: coordinates from the curved surface of the Earth are projected to a two-dimensional plane. As with any projection, there are assumptions, limitations, and distortions. Carl Sagan gives a beautiful demonstration of this concept using shadows, in this excerpt from Cosmos. Here is another excellent demonstration based on handwriting recognition.
Hole and Hironaka (1960) were some of the first pedologists to apply ordination methods to soils data. The main figure from their classic paper was hand-drawn, based on a physical model (constructed from wooden dowels and spheres!) of the ordination.Of the many possible ordination methods, there are two major types to keep in mind:
Constrained ordination: coordinates in the reduced space are subject to some kind of constraint (more rigid, simple interpretation). Principal component analysis (PCA) is one of the simplest and most widely used ordination methods. The reduced space (“principal components”) are defined by linear combinations of characteristics.
Unconstrained ordination: coordinates in the reduced space are not subject to any kind of constraint (more flexible, less interpretable). “Non-metric” multidimensional scaling (nMDS) attempts to generate a reduced space that minimizes distortion in proportional similarity; i.e., similar individuals are near each other in the reduced space, dissimilar individuals are farther apart. The “non-metric” adjective implies that exact distances are not preserved.
See (Legendre and Legendre 1998) for a comprehensive listing of methods, associated theory, and ecological applications.
The following example is based on a data matrix containing lab measurements of clay fraction, sand fraction, exchangeable Ca, exchangeable Mg, and CEC measured by NH4-Ac at pH 7.
| name | clay | sand | Mg | Ca | CEC_7 |
|---|---|---|---|---|---|
| A | -0.41 | 0.21 | 0.06 | 0.44 | -0.23 |
| ABt | 0.04 | -0.07 | -0.06 | -0.13 | -0.38 |
| Bt1 | 0.41 | -0.21 | -0.09 | -0.74 | -0.16 |
| … | … | … | … | … | … |
Note that distances between individuals, within clusters, and between clusters is more apparent the nMDS representation of the distance matrix. Similar information is embedded in the dendrogram but it is not as intuitive.
This is a complex topic, described in a supplemental set of slides.
If you want for more detailed information, see this relevant paper.
This is the fun part.
Install R packages as needed. Open a new R script file to use as you follow along.
# load libraries
library(aqp)
library(soilDB)
library(sharpshootR)
library(cluster)
library(ape)
library(RColorBrewer)
library(vegan)
library(MASS)
library(colorspace)
library(viridis)
Most of the examples used in the following exercises come from these sources:
aqp and soilDB packages (“sp4,” “gopheridge,” and “loafercreek”).fetchNASIS(): pedon data from the local NASIS selected set.fetchKSSL(): lab characterization data from the SoilWeb snapshot.fetchOSD(): basic morphologic and taxonomic data from the SoilWeb snapshot.SDA_query(): live SSURGO spatial and tabular data from Soil Data AccessIn most cases, you can edit the examples and swap-in just about any data that are in a SoilProfileCollection object. For example, pedons from your local NASIS selected set can be loaded with fetchNASIS().
Tinker with some SoilProfileCollection objects.
?fetchKSSL) or the SoilProfileCollection tutorial.
The aqp package provides two functions for checking the fraction of missing data within a SoilProfileCollection object. The first function (evalMissingData) generates an index that ranges from 0 (all missing) to 1 (all present) for each profile. This index can be used to subset or rank profiles for further investigation. The second function (missingDataGrid) creates a visualization of the fraction of data missing within each horizon. Both functions can optionally filter-out horizons that don’t typically have data, for example Cr, Cd, and R horizons.
The following examples are based on the gopheridge sample dataset.
evalMissingData
# example data
data("gopheridge", package = "soilDB")
# compute data completeness
gopheridge$data.complete <- evalMissingData(gopheridge, vars = c('clay', 'sand', 'phfield'), name = 'hzname', p = 'Cr|R|Cd')
# check range in missing data
summary(gopheridge$data.complete)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2143 0.4672 1.0000 1.0000
# rank
new.order <- order(gopheridge$data.complete)
# plot along data completeness ranking
par(mar=c(3,0,1,1))
plot(gopheridge, plot.order=new.order, print.id=FALSE, name='')
# add axis, note re-ordering of axis labels
axis(side=1, at=1:length(gopheridge), labels = round(gopheridge$data.complete[new.order], 2),
line=-2, cex.axis=0.65, las=2)
title('Gopheridge pedons sorted according to data completeness (clay, sand, pH)')
missingDataGrid
# view missing data as a fraction
res <- missingDataGrid(gopheridge, max_depth=100, vars=c('clay', 'sand', 'phfield'), filter.column='hzname', filter.regex = 'Cr|R|Cd', main='Fraction of missing data (clay, sand, pH)', cols = viridis(10))
# check results
head(res)
## peiid clay sand phfield
## 1 242808 29 29 0
## 2 252851 29 29 29
## 3 268791 0 0 0
## 4 268793 33 33 33
## 5 268794 33 33 33
## 6 268795 0 33 0
For now, extract those profiles that have a complete set of field-described clay, sand, or pH values for later use.
# be sure to read the manual page for this function
gopheridge.complete <- subsetProfiles(gopheridge, s = "data.complete > 0.99")
# another way
idx <- which(gopheridge$data.complete > 0.99)
gopheridge.complete <- gopheridge[idx, ]
# looks good
par(mar=c(0,0,3,1))
plot(gopheridge.complete, color='clay', id.style='side', label='pedon_id')
The following three functions are essential to the creation of a distance matrix:
dist(): This function is in base R, is simple and fast, and has a limited number of distance metrics.
daisy(): This function is in cluster package, has a better selection of distance metrics, and has simple standardization. Much more convenient than dist().vegdist(): This function is in vegan package, has many distance metrics, and is primarily designed for species composition data.The following is a short demonstration:
# get some example data from the aqp package
data('sp4', package = 'aqp')
# subset select rows and columns
sp4 <- sp4[1:4, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
row.names(sp4) <- sp4$name
# compare distance functions
# Euclidean distance, no standardization
round(dist(sp4[, -1], method = 'euclidean'))
## A ABt Bt1
## ABt 8
## Bt1 15 7
## Bt2 48 44 39
# Euclidean distance, standardization
round(daisy(sp4[, -1], stand = TRUE, metric = 'euclidean'), 2)
## Dissimilarities :
## A ABt Bt1
## ABt 1.45
## Bt1 2.73 1.36
## Bt2 6.45 5.65 4.91
##
## Metric : euclidean
## Number of objects : 4
# Gower's generalized distance metric (includes standardization)
round(vegdist(sp4[, -1], method = 'gower'), 2)
## A ABt Bt1
## ABt 0.19
## Bt1 0.32 0.16
## Bt2 0.96 0.84 0.69
The following example is excerpted from “A Novel Display of Categorical Pedon Data”. This example illustrates an application of clustering binary data (presence or absence of a diagnostic feature). Internally, the diagnosticPropertyPlot() function uses the daisy() function to compute pair-wise distances using the “general dissimilarity coefficient” of Gower (Gower 1971). A concise summary of this distance metric is in (Kaufman and Rousseeuw 2005).
# load some example NASIS data
data(loafercreek, package='soilDB')
# cut-down to a subset, first 20 pedons
loafercreek <- loafercreek[1:20, ]
# get depth class
sdc <- getSoilDepthClass(loafercreek)
site(loafercreek) <- sdc
# diagnostic properties to consider, no need to convert to factors
v <- c('lithic.contact', 'paralithic.contact', 'argillic.horizon',
'cambic.horizon', 'ochric.epipedon', 'mollic.epipedon', 'very.shallow',
'shallow', 'mod.deep', 'deep', 'very.deep')
# do the analysis and save the results to object 'x'
x <- diagnosticPropertyPlot(loafercreek, v, k=5, grid.label='bedrckkind', dend.label = 'taxonname')
If you are wondering what is in the object x, the str() function or manual page (?diagnosticPropertyPlot) can help.
The go-to functions for hierarchical clustering are as follows:
hclust(): This function is agglomerative, is in base R, requires a distance matrix, and implements most of the commonly used linkage criteria.
agnes(): This function is agglomerative, is in cluster package, can perform standardization and distance calculations, and implements more linkage criteria.diana(): This function is divisive, is in cluster package, can perform standardization and distance calculations.hclustThe hclust() function and resulting hclust-class objects are simple to use, but limited.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
# distance matrix
d <- daisy(sp4[, -1], metric = 'euclidean', stand = TRUE)
# hierachical clustering with base function hclust
sp4.h <- hclust(d, method = 'ward.D')
sp4.h$labels <- sp4$name
# plot with basic plotting method... not many options here
par(mar=c(2,4,2,2))
plot(sp4.h, font=2, cex=0.85)
# ID clusters after cutting tree
rect.hclust(sp4.h, 4)
ape PackageThis example uses a different approach to plotting based on functions and classes from the ape package.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
# distance matrix
d <- daisy(sp4[, -1], metric = 'euclidean', stand = TRUE)
# divising clustering
dd <- diana(d)
# convert to ape class, via hclust class
h <- as.hclust(dd)
h$labels <- sp4$name
p <- as.phylo(h)
# define some nice colors
col.set <- brewer.pal(9, 'Set1')
# cut tree into 4 groups
groups <- cutree(h, 4)
# make color vector based on groups
cols <- col.set[groups]
The plot methods for phylo class objects are quite flexible. Be sure to see the manual page ?plot.phylo.
par(mar=c(1,1,1,1), mfcol=c(2,2))
plot(p, label.offset=0.125, direction='right', font=1, cex=0.85, main='dendrogram')
tiplabels(pch=15, col=cols)
plot(p, type='radial', font=1, cex=0.85, main='radial')
tiplabels(pch=15, col=cols)
plot(p, type='fan', font=1, cex=0.85, main='fan')
tiplabels(pch=15, col=cols)
plot(p, type='unrooted', font=1, cex=0.85, main='unrooted')
tiplabels(pch=15, col=cols)
Re-visiting our sample data from before, develop hierarchical clusterings using several strategies (methods / linkage criteria).
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
# distance matrix
d <- daisy(sp4[, -1], metric = 'euclidean', stand = TRUE)
# hierarchical clustering based on several strategies
# agglomerative
h.avg <- agnes(d, method='average')
h.single <- agnes(d, method='single')
h.complete <- agnes(d, method='complete')
h.ward <- agnes(d, method='ward')
h.flexible <- agnes(d, method='gaverage', par.method = 0.01)
# divisive
h.div <- diana(d)
The correlation between original distance matrix and cophenetic distance matrix is a reasonable index of how faithfully a dendrogram preserves the original pair-wise distances.
# agglomerative hierarchical clustering with various linkage criteria
corr.avg <- cor(d, cophenetic(h.avg))
corr.single <- cor(d, cophenetic(h.single))
corr.complete <- cor(d, cophenetic(h.complete))
corr.ward <- cor(d, cophenetic(h.ward))
corr.flexible <- cor(d, cophenetic(h.flexible))
# divisive hierarchical clustering
corr.div <- cor(d, cophenetic(h.div))
Combine the results into a table for quick comparison. Note that the agnes and diana functions provide an agglomerative / divisive coefficient that can be used to evaluate clustering efficiency (e.g. cluster size and “compactness”). Depending on the application, one metric may be more informative than the other.
res <- data.frame(
method=c('average', 'single', 'complete', 'ward', 'flexible UPGMA', 'divisive'),
cophenetic.correlation=c(corr.avg, corr.single, corr.complete, corr.ward, corr.flexible, corr.div),
agg_div.coefficient=c(h.avg$ac, h.single$ac, h.complete$ac, h.ward$ac, h.flexible$ac, h.div$dc)
)
# re-order
res <- res[order(res$cophenetic.correlation, decreasing = TRUE), ]
| method | cophenetic.correlation | agg_div.coefficient |
|---|---|---|
| average | 0.840 | 0.787 |
| flexible UPGMA | 0.840 | 0.778 |
| single | 0.784 | 0.613 |
| complete | 0.759 | 0.879 |
| ward | 0.756 | 0.894 |
| divisive | 0.703 | 0.874 |
Investigate differences graphically.
par(mar=c(0,0.25,1.5,0.25), mfrow=c(2,3))
plot(as.phylo(as.hclust(h.avg)), label.offset=0.125, direction='right', font=1, cex=0.65, main='Average')
tiplabels(pch=15, col=col.set[cutree(h.avg, 4)])
plot(as.phylo(as.hclust(h.single)), label.offset=0.125, direction='right', font=1, cex=0.65, main='Single')
tiplabels(pch=15, col=col.set[cutree(h.single, 4)])
plot(as.phylo(as.hclust(h.complete)), label.offset=0.125, direction='right', font=1, cex=0.65, main='Complete')
tiplabels(pch=15, col=col.set[cutree(h.complete, 4)])
plot(as.phylo(as.hclust(h.ward)), label.offset=0.125, direction='right', font=1, cex=0.65, main='Ward')
tiplabels(pch=15, col=col.set[cutree(h.ward, 4)])
plot(as.phylo(as.hclust(h.flexible)), label.offset=0.125, direction='right', font=1, cex=0.65, main='Flexible (0.01)')
tiplabels(pch=15, col=col.set[cutree(h.flexible, 4)])
plot(as.phylo(as.hclust(h.div)), label.offset=0.125, direction='right', font=1, cex=0.65, main='Divisive')
tiplabels(pch=15, col=col.set[cutree(h.div, 4)])
TODO: interpret main differences in shape
Test the “flexible UPGMA” method (Belbin, Faith, and Milligan 1992) by iterating over possible values for \(\beta\). Looks like a value near 0.01 would be ideal (e.g. highest cophenetic correlation). Interestingly, this is very close to the cophenetic correlation associated with the “average” linkage criteria.
# init a sequence spanning -1 to 1
beta <- seq(from=-1, to=1, length.out = 100)
# init an empty vector to store results
r <- vector(mode='numeric', length = length(beta))
# iterate over values of beta and compute cophenetic correlation
for(i in 1:length(r)) {
r[i] <- cor(d, cophenetic(agnes(d, method='gaverage', par.method = beta[i])))
}
# plot
par(mar=c(5,5,1,1))
plot(beta, r, type='l', xlab='beta parameter', ylab='cophenetic correlation', pch=16, las=1)
# locate the max coph. corr.
idx <- which.max(r)
# mark this point and label
points(beta[idx], r[idx], col='red', cex=2, lwd=2)
text(beta[idx], r[idx], labels = round(beta[idx], 3), pos=4)
The following example is rather basic. Many more possibilities are available in the dendextend package.
# load sample dataset from aqp package
data(sp3)
# promote to SoilProfileCollection
depths(sp3) <- id ~ top + bottom
# compute dissimilarity using different sets of variables
# note that these are rescaled to the interval [0,1]
d.1 <- profile_compare(sp3, vars=c('clay', 'L'), k=0, max_d=100, rescale.result=TRUE)
# cluster via divisive hierarchical algorithm
# convert to 'phylo' class
p.1 <- as.phylo(as.hclust(diana(d.1)))
# graphically compare diana() to agnes() using d.2
dueling.dendrograms(as.phylo(as.hclust(diana(d.1))),
as.phylo(as.hclust(agnes(d.1, method='average'))), lab.1='divisive', lab.2='agglomerative: average linkage')
The following creates simulated data for demonstration purposes, representing two populations: 1. mean = 0, sd = 0.3 2. mean = 1, sd = 0.3
# nice colors for later
col.set <- brewer.pal(9, 'Set1')
# 2D example
x <- rbind(matrix(rnorm(100, mean = 0, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
It is important to note that the k-means algorithm is sensitive to the initial selection of centroid locations (typically random). The default behavior of the kmeans() function does not attempt to correct for this limitation. Note that cluster assignment and centroid vary across runs (panels in the figure below).
par(mfrow=c(3,3), mar=c(1,1,1,1))
for(i in 1:9) {
cl <- kmeans(x, centers=3)
plot(x, col = col.set[cl$cluster], axes=FALSE)
grid()
points(cl$centers, col = col.set, pch = 8, cex = 2, lwd=2)
box()
}
Setting the nstart argument (“number of random starts”) to a value great than 1 (10 is ideal) will ensure that the final clustering configuration will remain stable between runs. Note that the cluster ID (color) will vary between runs, however, with nstart=10 the overal configuration remains the same.
par(mfrow=c(3,3), mar=c(1,1,1,1))
for(i in 1:9) {
cl <- kmeans(x, centers=3, nstart = 10, iter.max = 100)
plot(x, col = col.set[cl$cluster], axes=FALSE)
grid()
points(cl$centers, col = col.set, pch = 8, cex = 2, lwd=2)
box()
}
The cluster package provides two interfaces to the k-medoids algorithm:
pam(): small to medium sized data sets
clara(): optmized for larger data setsA quick example of using pam() to identify an increasing number of clusters.
par(mfrow=c(2,3), mar=c(1,1,1,1))
for(i in 2:7) {
cl <- pam(x, k = i, stand = TRUE)
plot(x, col = col.set[cl$clustering], axes=FALSE)
grid()
points(cl$medoids, col = col.set, pch = 0, cex = 2, lwd=2)
box()
}
Clustering results are in the form of class membership; values ranging between 0 and 1. This means that group membership is a continuum vs. the “hard” classes assigned by k-means or k-medoids. The “mixture” of class membership in the example below is conveniently expressed using proportions of red, green, and blue.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4.std <- data.frame(sp4[, c('id', 'name', 'top', 'bottom')], scale( sp4[, c('Mg', 'Ca')]))
# perform fuzzy clustering
cl <- fanny(sp4.std[, c('Mg', 'Ca')], k = 3, stand = FALSE)
# get membership matrix
m <- cl$membership
# convert to colors by interpreting membership as R,G,B proportions
cols <- rgb(m)
# setup plot
par(mar=c(4,4,0,0))
plot(sp4.std$Mg, sp4.std$Ca, asp=1, ylab='Exchangeable Mg (cmol/kg), Standardized', xlab='Exchangeable Ca (cmol/kg), Standardized', type='n')
abline(h=0, v=0, col='black')
grid()
# add original obs
points(sp4.std$Mg, sp4.std$Ca, bg=cols, col='black', cex=1.5, pch=21)
Save the RGB color representation of cluster membership to the source data.frame and convert to SoilProfileCollection.
sp4.std$colors <- cols
depths(sp4.std) <- id ~ top + bottom
par(mar=c(0,0,0,0))
plot(sp4.std, color='colors', cex.names=0.75)
title('Fuzzy Clustering Results in Context', line=-1)
There is no simple answer to the question “How many clusters are in my data?” Some metrics, however, can be used to help estimate a “reasonable” number of clusters. The mean silhouette width is a useful index of “cluster compactness” relative to neighbor clusters (Rousseeuw 1987). Larger silhouette widths suggest tighter grouping.
TODO: integrate the clustree package
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4.std <- data.frame(sp4[, c('id', 'name', 'top', 'bottom')], scale( sp4[, c('Mg', 'Ca')]))
# perform hard clustering
sil.widths <- vector(mode='numeric')
for(i in 2:10) {
cl <- pam(sp4.std[, c('Mg', 'Ca')], k = i, stand = FALSE)
sil.widths[i] <- cl$silinfo$avg.width
}
par(mar=c(4,4,3,1))
plot(sil.widths, type='b', xlab='Number of Clusters', ylab='Average Silhouette Width', las=1, lwd=2, col='RoyalBlue', cex=1.25, main='Finding the "Right" Number of Clusters')
grid()
According to this metric, it looks like 3 clusters is reasonable. Again, this is a judgement call–most decisions related to clustering algorithm selection and the “optimal number of clusters” are somewhat subjective.
# perform fuzzy clustering
cl <- pam(sp4.std[, c('Mg', 'Ca')], k = 3, stand = FALSE)
# setup plot
par(mar=c(4,4,0,0))
plot(sp4.std$Mg, sp4.std$Ca, asp=1, ylab='Exchangeable Mg (cmol/kg), Standardized', xlab='Exchangeable Ca (cmol/kg), Standardized', type='n')
abline(h=0, v=0, col='black')
grid()
# add original obs
points(sp4.std$Mg, sp4.std$Ca, bg=cl$clustering, col='black', cex=1.25, pch=21)
A simple, constrained ordination based on variance. This method does not use the distance matrix, rather it seeks to find a new set of axes that describe maximum variance via linear combinations of characteristics. Standardization is essential.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
sp4.scaled <- data.frame(name=sp4[, 1], round(scale( sp4[, -1]), 2))
# PCA
# note that we are leaving out the first column: the horizon names
# note the syntax used to extract the principal components
# note that PCA doesn't use the distance matrix
pca <- predict(princomp(sp4.scaled[, -1]))
## perform clustering to highlight structure in the PCA
# distance matrix
d <- dist(sp4.scaled[, -1])
m <- as.matrix(d)
dimnames(m) <- list(sp4.scaled$name, sp4.scaled$name)
d <- as.dist(m)
# dendrogram from divisive clustering
dd <- diana(d)
h <- as.hclust(dd)
p <- as.phylo(h)
# define colors based on cutting a divisive hierarchical clustering into 4 groups
cols <- brewer.pal(9, 'Set1')[cutree(h, 4)]
# plot first 2 PC
plot(pca[, 1:2], asp=1, type='n', axes=FALSE, xlab='', ylab='', main="Principal Components 1 and 2")
grid()
text(pca[, 1:2], sp4.scaled$name, cex=0.75, col=cols, font=2)
box()
Simple interface to nMDS, input is a distance matrix. Note that this algorithm will fail if there are 0’s or ties within the distance matrix. See ?sammon for details.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
sp4.scaled <- data.frame(name=sp4[, 1], round(scale( sp4[, -1]), 2))
# distance matrix
d <- dist(sp4.scaled[, -1])
m <- as.matrix(d)
dimnames(m) <- list(sp4.scaled$name, sp4.scaled$name)
d <- as.dist(m)
# dendrogram from divisive clustering
dd <- diana(d)
h <- as.hclust(dd)
p <- as.phylo(h)
# define colors based on cutting a divisive hierarchical clustering into 4 groups
cols <- brewer.pal(9, 'Set1')[cutree(h, 4)]
# nMDS from distance matrix
s <- sammon(d)
# plot
par(mar=c(3,1,3,1))
plot(s$points, asp=1, type='n', axes=FALSE, xlab='', ylab='', main="nMDS by Sammon's Non-Linear Mapping")
# abline(v=0, h=0, col='black')
grid()
text(s$points, rownames(s$points), cex=0.75, col=cols, font=2)
box()
vegan PackageThe following example is quite brief. See the Introduction to ordination in vegan vignette for some excellent worked examples and ecological interpretation. The vegan package FAQ is another excellent resource. Numerical Ecology with R can be used as both reference and learning resource.
The metaMDS() function from the vegan package provides a convenience function that automates most of the steps required to create an oridination.
# re-make data, this time with all profiles
data('sp4', package = 'aqp')
sp4 <- sp4[, c('name', 'clay', 'sand', 'Mg', 'Ca', 'CEC_7')]
sp4.scaled <- data.frame(name=sp4[, 1], round(scale( sp4[, -1]), 2))
# define colors based on natural groupings
cols <- brewer.pal(9, 'Set1')
# distance calc + ordination
s <- metaMDS(sp4.scaled[, -1], distance = 'gower', autotransform = FALSE, wascores=FALSE)
## this is used to generate 4 classes from a divisive hierarchical clustering
# manually compute distance matrix
d <- dist(sp4.scaled[, -1])
m <- as.matrix(d)
dimnames(m) <- list(sp4.scaled$name, sp4.scaled$name)
d <- as.dist(m)
# dendrogram from divisive clustering
dd <- diana(d)
h <- as.hclust(dd)
# define colors based on cutting a divisive hierarchical clustering into 4 groups
cols <- brewer.pal(9, 'Set1')[cutree(h, 4)]
# plot ordination
par(mar=c(3,1,3,1))
fig <- ordiplot(s, type='none', cex.axis=0.75, axes=FALSE, xlab='', ylab='', main='nMDS by metaMDS()')
abline(h=0, v=0, lty=2, col='grey')
text(fig$sites, sp4$name, cex=0.75, font=2, col=cols)
# ordicluster(fig, agnes(daisy(sp4.scaled[, -1]), method='ward'), prune=3, col = "orange")
box()
procrustes, etc.
# https://ncss-tech.github.io/AQP/aqp/color-contrast.html
Before you work through the following examples, you should review the SoilProfileCollection object tutorial.
# init example data
data(sp4)
depths(sp4) <- id ~ top + bottom
# eval dissimilarity:
# using Ex-Ca:Mg and CEC at pH 7
# with no depth-weighting (k=0)
# to a maximum depth of 40 cm
d <- profile_compare(sp4, vars=c('ex_Ca_to_Mg', 'CEC_7'), k=0, max_d=40)
# check distance matrix:
round(d, 1)
## Dissimilarities :
## colusa glenn kings mariposa mendocino napa san benito shasta shasta-trinity
## glenn 13.5
## kings 16.0 12.7
## mariposa 8.4 11.3 16.5
## mendocino 11.5 8.0 16.4 15.0
## napa 30.4 24.1 29.4 29.2 21.6
## san benito 25.7 20.6 26.3 28.2 15.8 18.0
## shasta 17.2 13.3 8.7 17.6 17.1 33.7 22.2
## shasta-trinity 6.4 16.6 22.3 9.6 16.5 29.8 27.2 23.3
## tehama 28.7 22.9 27.9 27.3 20.0 8.8 15.1 31.4 27.9
##
## Metric : mixed ; Types = I, I
## Number of objects : 10
# cluster via divisive method
clust <- diana(d)
# vizualize dissimilarity matrix via hierarchical clustering
par(mar=c(0,0,3,0))
plotProfileDendrogram(sp4, clust, dend.y.scale = max(d), scaling.factor = (1/max(d) * 10), y.offset = 2, width=0.15, cex.names=0.45, color='ex_Ca_to_Mg', col.label='Exchageable Ca to Mg Ratio')
The following are demonstrations of pair-wise distances computed from categorical data and the use of a dendrogram to organize groups from Soil Taxonomy. Click here for details.
# define a vector of series
s.list <- c('amador', 'redding', 'pentz', 'willows', 'pardee', 'yolo', 'hanford', 'cecil', 'sycamore', 'KLAMATH', 'MOGLIA', 'drummer', 'musick', 'zook', 'argonaut', 'PALAU')
# get and SPC object with basic data on these series
s <- fetchOSD(s.list)
# graphical check
par(mar=c(0,0,2,0))
plot(s) ; title('Selected Pedons from Official Series Descriptions', line=0)
# check structure of some site-level attributes
head(site(s))[, c('id', 'soilorder', 'suborder', 'greatgroup', 'subgroup')]
## id soilorder suborder greatgroup subgroup
## 1 AMADOR inceptisols xerepts haploxerepts typic haploxerepts
## 2 ARGONAUT alfisols xeralfs haploxeralfs mollic haploxeralfs
## 3 CECIL ultisols udults kanhapludults typic kanhapludults
## 4 DRUMMER mollisols aquolls endoaquolls typic endoaquolls
## 5 HANFORD entisols orthents xerorthents typic xerorthents
## 6 KLAMATH mollisols aquolls cryaquolls cumulic cryaquolls
par(mar=c(0,1,1,1))
# plot dendrogram + profiles
d <- SoilTaxonomyDendrogram(s, scaling.factor = 0.01)
Check the resulting distance matrix.
print(d)
Integrate related tutorial Explain.
# manually convert Munsell -> sRGB
rgb.data <- munsell2rgb(s$hue, s$value, s$chroma, return_triplets = TRUE)
s$r <- rgb.data$r
s$g <- rgb.data$g
s$b <- rgb.data$b
# eval color signature
pig <- soilColorSignature(s, RescaleLightnessBy = 5, method='depthSlices')
# display results as table
knitr::kable(head(pig), digits = 3, row.names = FALSE)
| id | A.0.1 | A.0.5 | A.0.9 | B.0.1 | B.0.5 | B.0.9 | L.0.1 | L.0.5 | L.0.9 |
|---|---|---|---|---|---|---|---|---|---|
| AMADOR | 3.681 | 4.870 | 1.919 | 13.298 | 18.956 | 27.828 | 8.250 | 10.324 | 14.333 |
| ARGONAUT | 13.549 | 19.311 | 15.876 | 19.548 | 30.215 | 17.613 | 6.159 | 6.161 | 8.250 |
| CECIL | 7.378 | 32.895 | 28.722 | 25.999 | 29.497 | 35.807 | 8.254 | 8.251 | 8.254 |
| DRUMMER | 2.189 | 1.478 | 1.478 | 5.367 | 6.176 | 6.176 | 4.111 | 8.247 | 8.247 |
| HANFORD | 5.629 | 5.629 | 6.611 | 19.847 | 19.847 | 25.480 | 8.252 | 8.252 | 10.327 |
| KLAMATH | 2.189 | 2.189 | 10.371 | 5.367 | 5.367 | 37.715 | 4.111 | 4.111 | 8.257 |
# move row names over for distance matrix
row.names(pig) <- pig[, 1]
d <- daisy(pig[, -1])
dd <- diana(d)
# plot
par(mar=c(0,0,0.25,1))
plotProfileDendrogram(s, dd, dend.y.scale = max(d) * 2, scaling.factor = 0.3, y.offset = 6, width=0.2, cex.names=0.45)
Just for fun, use hierarchical clustering and nMDS on soil color data from the OSDs that were used in the previous example.
# extract horizon data from select OSDs in above example
h <- horizons(s)
# convert Munsell color notation to sRGB
# these are moist colors
rgb.data <- munsell2rgb(h$hue, h$value, h$chroma, return_triplets = TRUE)
lab.data <- munsell2rgb(h$hue, h$value, h$chroma, returnLAB = TRUE)
# check
head(rgb.data)
## r g b
## 1 0.4360624 0.3706674 0.29697452
## 2 0.5589675 0.4673350 0.35663875
## 3 0.5589675 0.4673350 0.35663875
## 4 0.7719679 0.6774631 0.48997537
## 5 0.3940324 0.2499977 0.16682669
## 6 0.4309729 0.2327690 0.09771028
head(lab.data)
## L A B
## 1 41.24855 3.681301 13.29762
## 2 51.62124 4.870262 18.95563
## 3 51.62124 4.870262 18.95563
## 4 71.66388 1.919454 27.82850
## 5 30.79580 13.548688 19.54789
## 6 30.80674 19.311423 30.21539
# remove NA
rgb.data <- na.omit(rgb.data)
lab.data <- na.omit(lab.data)
# retain unique colors
rgb.data <- unique(rgb.data)
lab.data <- unique(lab.data)
# visualize colors in LAB coordinates
pairs(lab.data, col='white', bg=rgb(rgb.data), pch=21, cex=2)
# create distance matrix from LAB coordinates
d <- daisy(lab.data, stand=FALSE)
# divisive heirarcical clustering
d.hclust <- as.hclust(diana(d))
# convert to phylo class for nicer plotting
p <- as.phylo(d.hclust)
# perform nMDS on distance matrix
d.sammon <- sammon(d)
# setup multi-figure page
par(mfcol=c(1,2), mar=c(0,0,2,0), bg=grey(0.95))
# plot fan-style dendrogram
plot(p, font=2, cex=0.5, type='fan', show.tip.label=FALSE, main='Dendrogram Representation')
# add colors at dendrogram tips
tiplabels(pch=21, cex=4, col='white', bg=rgb(rgb.data))
# plot nMDS ordination
plot(d.sammon$points, type='n', axes=FALSE, xlab='', ylab='', asp=1, main='nMDS Ordination')
abline(h=0, v=0, col='black', lty=3)
points(d.sammon$points, bg=rgb(rgb.data), pch=21, cex=3.5, col='white')
Example borrowed from this tutorial.
library(reshape2)
# set list of component names, same as soil color example
s.list <- c('amador', 'redding', 'pentz', 'willows', 'pardee', 'yolo',
'hanford', 'cecil', 'sycamore', 'KLAMATH', 'MOGLIA', 'drummer',
'musick', 'zook', 'argonaut', 'PALAU')
# set list of relevant interpretations
interp.list <- c('ENG - Construction Materials; Topsoil',
'ENG - Sewage Lagoons', 'ENG - Septic Tank Absorption Fields',
'ENG - Unpaved Local Roads and Streets')
# compose query
q <- paste0("SELECT UPPER(compname) as compname, mrulename, AVG(interplr) as interplr_mean
FROM component INNER JOIN cointerp ON component.cokey = cointerp.cokey
WHERE compname IN ", format_SQL_in_statement(s.list), "
AND seqnum = 0
AND mrulename IN ", format_SQL_in_statement(interp.list), "
AND interplr IS NOT NULL
GROUP BY compname, mrulename;")
# send query
x <- SDA_query(q)
# reshape long -> wide
x.wide <- dcast(x, compname ~ mrulename, value.var = 'interplr_mean')
knitr::kable(x.wide, digits = 3, caption="Mean Fuzzy Ratings for Select Soil Series")
| compname | ENG - Construction Materials; Topsoil | ENG - Septic Tank Absorption Fields | ENG - Sewage Lagoons | ENG - Unpaved Local Roads and Streets |
|---|---|---|---|---|
| AMADOR | 0.000 | 1.000 | 1.000 | 0.691 |
| ARGONAUT | 0.050 | 1.000 | 1.000 | 1.000 |
| CECIL | 0.416 | 0.663 | 0.847 | 0.297 |
| DRUMMER | 0.000 | 1.000 | 1.000 | 1.000 |
| HANFORD | 0.678 | 0.989 | 1.000 | 0.220 |
| KLAMATH | 0.000 | 1.000 | 1.000 | 1.000 |
| MOGLIA | 0.000 | 1.000 | 0.500 | 1.000 |
| MUSICK | 0.245 | 1.000 | 0.963 | 0.832 |
| PALAU | 0.011 | 1.000 | 0.864 | 1.000 |
| PARDEE | 0.000 | 1.000 | 1.000 | 1.000 |
| PENTZ | 0.003 | 1.000 | 1.000 | 0.714 |
| REDDING | 0.022 | 1.000 | 1.000 | 0.907 |
| SYCAMORE | 0.824 | 0.998 | 0.756 | 0.919 |
| WILLOWS | 0.000 | 1.000 | 0.947 | 1.000 |
| YOLO | 0.843 | 0.914 | 0.608 | 0.769 |
| ZOOK | 0.004 | 1.000 | 1.000 | 1.000 |
# note: component name and series name have been converted to upper case
# sort rows of fuzzy ratings based on profiles from OSDs
new.order <- match(x.wide$compname, profile_id(s))
x.wide <- x.wide[new.order, ]
# copy ids to row.names so that they are preserved in distance matrix
row.names(x.wide) <- x.wide$compname
# create distance matrix
d <- daisy(x.wide[, -1])
# divisive hierarchical clustering
clust <- diana(d)
par(mar=c(2,0,2,0))
plotProfileDendrogram(s, clust, dend.y.scale = 1.5, scaling.factor = 0.004, y.offset = 0.1, width=0.25, cex.names=0.45)
title('Component Similarity via Select Fuzzy Ratings')
mtext('Profile Sketches are from OSDs', 1)
This example generates an ordination of environmental variables (PRSIM and elevation) associated with MLRAs 15, 18, 22A, and 22B. The contours contain 75% (dotted line), 50% (dashed line), and 25% (solid line) of the points. See comments in the code for details. Note that this example was extracted from the Region 2 Map Unit Comparison Report.
library(MASS)
library(vegan)
library(cluster)
library(RColorBrewer)
# get example data
# init a temp file
tf <- tempfile()
# download compressed CSV to temp file
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/clustering_and_ordination/MLRA-raster-data-example.csv.gz', destfile = tf, quiet = TRUE)
# read-in from compressed CSV to data.frame object
d.sub <- read.csv(gzfile(tf), stringsAsFactors = FALSE)
# check: note that the column names are quite strange
head(d.sub)
# set factor levels
mu.set <- c('15', '18', '22A', '22B')
d.sub$.id <- factor(d.sub$.id, levels = mu.set)
# define some nice colors
cols <- brewer.pal(9, 'Set1')
# remove light colors
cols <- cols[c(1:5,7,9)]
# http://stackoverflow.com/questions/16225530/contours-of-percentiles-on-level-plot
kdeContours <- function(i, prob, cols, m, ...) {
# need more than 2 rows of data
if(nrow(i) < 2) {
return(NULL)
}
# keep track of colors
this.id <- unique(i$.id)
this.col <- cols[match(this.id, m)]
# estimate density
dens <- kde2d(i$x, i$y, n=200)
# differentiate to estimate quantiles
dx <- diff(dens$x[1:2])
dy <- diff(dens$y[1:2])
sz <- sort(dens$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
# add contours if possible
if(!is.na(levels))
contour(dens, levels=levels, drawlabels=FALSE, add=TRUE, col=this.col, ...)
}
## NOTE: data with very low variability will cause warnings
# eval numerical distance, removing first 3 columns of IDs
d.dist <- daisy(d.sub[, -c(1:3)], stand=TRUE)
## map distance matrix to 2D space via principal coordinates
d.betadisper <- betadisper(d.dist, group=d.sub$.id, bias.adjust = TRUE, sqrt.dist = TRUE, type='median')
d.scores <- scores(d.betadisper)
# split data into a list by ID
s <- data.frame(x=d.scores$sites[, 1], y=d.scores$sites[, 2], .id=d.sub$.id)
s <- split(s, s$.id)
# plot
par(mar=c(1,1,3,1))
plot(d.scores$sites, type='n', axes=FALSE)
abline(h=0, v=0, lty=2, col='grey')
# add contours of prob density
# NOTE: lines are not added if data are too densely spaced for evaluation of requested prob. level
res <- lapply(s, kdeContours, prob=c(0.75), cols=cols, m=levels(d.sub$.id), lwd=1, lty=3)
res <- lapply(s, kdeContours, prob=c(0.5), cols=cols, m=levels(d.sub$.id), lwd=1, lty=2)
res <- lapply(s, kdeContours, prob=c(0.25), cols=cols, m=levels(d.sub$.id), lwd=2, lty=1)
# add sample points
points(d.scores$sites, cex=0.45, col=cols[as.numeric(d.sub$.id)], pch=16)
# note special indexing for cases when low-var MU have been removed
ordilabel(d.betadisper, display='centroids', col=cols[match(mu.set, levels(d.sub$.id))])
title('Ordination of Raster Samples (cLHS Subset) with 25%, 50%, 75% Density Contours')
box()
Thoughts on interpretation:
The following is an investigation of topographic “signatures” from the initial soil mapping project of Sequoia and Kings Canyon National Park, CA.
library(plyr)
library(reshape2)
library(vegan)
library(cluster)
# init a temp file
tf <- tempfile()
# download compressed CSV to temp file
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/clustering_and_ordination/seki-mu-gis-samples.csv.gz', destfile = tf, quiet = TRUE)
# read-in from compressed CSV to data.frame object
x <- read.csv(gzfile(tf), stringsAsFactors = FALSE)
# check
head(x)
# get terrain variable names
vars <- c('elev', 'solar', 'slope', 'tci', 'ppt', 'maat', 'tpi', 'tri', 'pi')
# covnvert wide to long format for summary by MY symbol
x.long <- melt(x, id.vars = 'MUSYM', measure.vars = vars)
x.summary <- ddply(x.long, c('MUSYM', 'variable'), .fun=plyr::summarize, m=median(value, na.rm = TRUE))
head(x.summary)
# retain median values, convert back to wide format
x.wide <- dcast(x.summary, MUSYM ~ variable, value.var = 'm')
row.names(x.wide) <- x.wide$MUSYM
head(x.wide, 2)
# generate lookup table of MUSYM - MU groups
lut <- unique(x[, c('MUSYM', 'gensym')])
# join data matrix with lut
x.wide <- join(x.wide, lut, by='MUSYM', type='left', match='first')
x.wide$gensym <- factor(x.wide$gensym, levels = c("riparian", "xx10", "xx20", "xx30", "xx40", "xx50"), labels = c("riparian", "basins and cirques", "glacial valleys", "bedrock controlled mountain slopes and valley walls", "non bedrock controlled slopes", "plateaus, till plains, outwash plains"))
# perform non-metric MDS
nmds <- metaMDS(x.wide[, vars], distance = 'gower', autotransform = FALSE, wascores=FALSE)
# setup colors
cols <- c('RoyalBlue', 'black', 'DarkGreen', 'orange3', 'red', 'brown')
site.colors <- cols[as.numeric(x.wide$gensym)]
# plot ordination
par(mar=c(1,1,1,1))
fig <- ordiplot(nmds, type='none', cex.axis=0.75, axes=FALSE)
abline(h=0, v=0, lty=2, col='grey')
text(fig, "sites", cex=0.65, font=2, col=site.colors)
legend('bottomleft', legend=levels(x.wide$gensym), col=cols, pch=15, pt.cex=2, bty='n', cex=1)
title('Map Unit Terrain Signatures')
This document is based on aqp version 1.19.01 and soilDB version 2.5.1 and sharpshootR version 1.6. |
Arkley, Rodney J. 1976. “Statistical Methods in Soil Classification Research.” In Advances in Agronomy, edited by N. C. Brady, 37–69. New York, NY: Academic Press.
Belbin, Lee, Daniel P. Faith, and Glenn W. Milligan. 1992. “A Comparison of Two Approaches to Beta-Flexible Clustering.” Multivariate Behavioral Research 27 (3). Routledge: 417–33. https://doi.org/10.1207/s15327906mbr2703\_6.
Gower, J. C. 1971. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics 27 (4). International Biometric Society: 857–71. http://www.jstor.org/stable/2528823.
Hole, F. D., and M. Hironaka. 1960. “An Experiment in Ordination of Some Soil Profiles.” Proceedings of the Soil Science Society of America 24 24: 309–12.
Kaufman, Leonard, and Peter J. Rousseeuw. 2005. Finding Groups in Data an Introduction to Cluster Analysis. Wiley-Interscience.
Legendre, P., and L. Legendre. 1998. Numerical Ecology. 2nd ed. Developments in Environmental Modeling 20. Amsterdam: Elsevier.
Rousseeuw, P.J. 1987. “Silhouettes: A Grapical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathmatics 20: 53–65.
Sneath, Peter H. A., and Robert R. Sokal. 1973. Numerical Taxonomy. W.H. Freeman; Company.